Use NTP admin API instead of sled agent, remove sled agent time_sync API #8597

smklein · 2025-07-15T00:21:13Z

Builds on #8555

This PR:

Avoids the sled agent "zlogin to NTP zone" calls to chronyc, instead relying on the NTP admin server
Removes the "timesync_get" API exposed from the sled agent, and moves all callers (RSS only) to call the NTP admin server instead

smklein · 2025-07-15T21:56:23Z

FYI, I ran this on a4x2 - it didn't work, until I fixed an SMF typo, now it works. This is tested implicitly by the "helios / deploy" job, which relies on RSS, and also relies on the timesync API in the NTP admin server.

karencfv

Looks great, thanks!

karencfv · 2025-07-15T23:45:03Z

openapi/sled-agent.json

@@ -1464,29 +1464,6 @@
        }
      }
    },
-    "/timesync": {


I wonder if there's any documentation that relies on checking the response from this endpoint in the logs to debug an NTP sync issue or something similar. I have a vague memory of something like that. If anyone remembers, that documentation should be updated. Perhaps an announcement on Matrix might be good as well? I've certainly used it before if an omicron deployment was taking too long to start.

jgallagher

LGTM, thanks!

FYI, I ran this on a4x2 - it didn't work, until I fixed an SMF typo, now it works. This is tested implicitly by the "helios / deploy" job, which relies on RSS, and also relies on the timesync API in the NTP admin server.

I don't see an SMF change in this PR - was it something a4x2-specific outside omicron?

jgallagher · 2025-07-16T15:11:13Z

sled-agent/src/rack_setup/service.rs

+        let ntp_addresses: Vec<_> = service_plan
+            .services
+            .iter()
+            .flat_map(|(_, sled_config)| {


I'm kinda tempted to suggest we do something here to confirm we have exactly one NTP zone per sled (probably via asserting, since we came up with the plan and it should always be correct?). Maybe that would be overkill, since 0 or more-than-one NTP zone on a sled is going to cause problems pretty quickly anyway? Something like (using exactly_one() from itertools):

let ntp_addresses: Vec<_> = service_plan .services .iter() .map(|(_, sled_config)| { sled_config.zones.iter().filter_map(|zone_config| { // ...extract ntp_admin_addr... }) .exactly_one() .expect("plan should specify one NTP zone per sled") }) .collect();

jgallagher · 2025-07-16T15:12:18Z

sled-agent/src/rack_setup/service.rs

+                        BlueprintZoneType::BoundaryNtp(
+                            blueprint_zone_type::BoundaryNtp {
+                                address, ..
+                            },
+                        ) => {
+                            let mut ntp_admin_addr = *address;
+                            ntp_admin_addr.set_port(NTP_ADMIN_PORT);
+                            Some(ntp_admin_addr)
+                        }
+                        BlueprintZoneType::InternalNtp(
+                            blueprint_zone_type::InternalNtp { address },
+                        ) => {
+                            let mut ntp_admin_addr = *address;
+                            ntp_admin_addr.set_port(NTP_ADMIN_PORT);
+                            Some(ntp_admin_addr)
+                        }


Since both of these arms bind a field named address with the same type, I think you can combine them?

BlueprintZoneType::BoundaryNtp( blueprint_zone_type::BoundaryNtp { address, .. }, ) | BlueprintZoneType::InternalNtp( blueprint_zone_type::InternalNtp { address }, ) => { let mut ntp_admin_addr = *address; ntp_admin_addr.set_port(NTP_ADMIN_PORT); Some(ntp_admin_addr) }

jgallagher · 2025-07-16T15:18:03Z

sled-agent/src/rack_setup/service.rs

@@ -724,17 +728,20 @@ impl ServiceInner {

    async fn wait_for_timesync(
        &self,
-        sled_addresses: &Vec<SocketAddrV6>,
+        ntp_admin_addresses: &Vec<SocketAddrV6>,


I realize this is consistent with what was already here, but: by accepting a list of socket addrs, we end up rebuilding an NtpAdminClient for every sled inside every attempt of the retry_notify loop. Maybe we should take a &[NtpAdminClient] instead of a list of socket addrs, so we only create the clients once? (I have a vague memory of reqwest clients being relatively expensive to construct, but maybe we solved that some other way?)

jgallagher · 2025-07-16T15:18:43Z

sled-agent/config-reconciler/src/reconciler_task/zones.rs

-
-        let stdout = running_ntp_zone
-            .run_cmd(&["/usr/bin/chronyc", "-c", "tracking"])
-            .map_err(TimeSyncError::ExecuteChronyc)?;


🎉 Love getting rid of this.

smklein · 2025-07-16T17:49:31Z

LGTM, thanks!

FYI, I ran this on a4x2 - it didn't work, until I fixed an SMF typo, now it works. This is tested implicitly by the "helios / deploy" job, which relies on RSS, and also relies on the timesync API in the NTP admin server.

I don't see an SMF change in this PR - was it something a4x2-specific outside omicron?

I pulled the SMF change into #8555

(<dependency name='oxide/ntp' causes the manifest importer to reject the whole service because in that context, it decides / is a character it doesn't like. I had to use <dependency name='ntp' to make it happy)

smklein force-pushed the ntp-admin-use branch 2 times, most recently from c4faf10 to 8f65850 Compare July 15, 2025 17:51

smklein force-pushed the ntp-admin branch from 8ad59a1 to 447ae1e Compare July 15, 2025 19:35

smklein force-pushed the ntp-admin-use branch from 8f65850 to bf49154 Compare July 15, 2025 19:35

smklein marked this pull request as ready for review July 15, 2025 19:53

smklein requested review from jgallagher and karencfv July 15, 2025 21:54

smklein mentioned this pull request Jul 15, 2025

Trying to collect timesync status in inventory #8603

Draft

karencfv approved these changes Jul 15, 2025

View reviewed changes

jgallagher approved these changes Jul 16, 2025

View reviewed changes

Base automatically changed from ntp-admin to main July 16, 2025 17:50

Use NTP admin API instead of sled agent, remove sled agent time_sync API

4f6de4b

smklein force-pushed the ntp-admin-use branch from bf49154 to 4f6de4b Compare July 16, 2025 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use NTP admin API instead of sled agent, remove sled agent time_sync API #8597

Use NTP admin API instead of sled agent, remove sled agent time_sync API #8597

smklein commented Jul 15, 2025 •

edited

Loading

Uh oh!

smklein commented Jul 15, 2025

Uh oh!

karencfv left a comment

Uh oh!

karencfv Jul 15, 2025

Uh oh!

jgallagher left a comment

Uh oh!

jgallagher Jul 16, 2025

Uh oh!

jgallagher Jul 16, 2025

Uh oh!

jgallagher Jul 16, 2025

Uh oh!

jgallagher Jul 16, 2025

Uh oh!

smklein commented Jul 16, 2025

Uh oh!

Uh oh!

Use NTP admin API instead of sled agent, remove sled agent time_sync API #8597

Are you sure you want to change the base?

Use NTP admin API instead of sled agent, remove sled agent time_sync API #8597

Conversation

smklein commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smklein commented Jul 15, 2025

Uh oh!

karencfv left a comment

Choose a reason for hiding this comment

Uh oh!

karencfv Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

smklein commented Jul 16, 2025

Uh oh!

Uh oh!

smklein commented Jul 15, 2025 •

edited

Loading